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Abstract 

Background: Statistics in ranked lists is useful in analysing molecular biology measurement data, such as 
differential expression, resulting in ranked lists of genes, or ChlP-Seq, which yields ranked lists of genomic 
sequences. State of the art methods study fixed motifs in ranked lists of sequences. More flexible models such as 
position weight matrix (PWM) motifs are more challenging in this context, partially because it is not clear how to 
avoid the use of arbitrary thresholds. 

Results: To assess the enrichment of a PWM motif in a ranked list we use a second ranking on the same set of 
elements induced by the PWM. Possible orders of one ranked list relative to another can be modelled as 
permutations. Due to sample space complexity, it is difficult to accurately characterize tail distributions in the group 
of permutations. In this paper we develop tight upper bounds on tail distributions of the size of the intersection of 
the top parts of two uniformly and independently drawn permutations. We further demonstrate advantages of this 
approach using our software implementation, mmHG-Finder, which is publicly available, to study PWM motifs in 
several datasets. In addition to validating known motifs, we found GC-rich strings to be enriched amongst the 
promoter sequences of long non-coding RNAs that are specifically expressed in thyroid and prostate tissue samples 
and observed a statistical association with tissue specific CpG hypo-methylation. 

Conclusions: We develop tight bounds that can be calculated in polynomial time. We demonstrate utility of 
mutual enrichment in motif search and assess performance for synthetic and biological datasets. We suggest that 
thyroid and prostate-specific long non-coding RNAs are regulated by transcription factors that bind GC-rich 
sequences, such as EGR1, SP1 and E2F3. We further suggest that this regulation is associated with DNA 
hypo-methylation. 

Keywords: Statistical enrichment, Position weight matrices, High-throughput sequencing data analysis, Tissue 
specific methylation patterns, IncRNA 



Background 

Modern data analysis often faces the task of extracting 
characteristic features from sets of elements singled out 
according to some measurement. In molecular biology, 
for example, an experiment may lead to measurement 
results pertaining to genes and then questions are asked 
about the properties of genes for which these were high 
or low. This is an example, of course, and the set of 
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elements does not have to be genes. They can be 
genomic regions, proteins, structures, etc. A central 
technique for addressing the analysis of characteristic 
properties of sets of elements is statistical enrichment. 
More specifically - the experiment results are often rep- 
resentable as ranked lists of elements and we then seek 
enrichment of other properties of these elements at the 
top or bottom of the ranked list. GSEA [1], for example, 
is a tool that addresses characteristic features of genes 
that are found to be differentially expressed in a com- 
parative transcriptomics study. GOrilla [2,3] addresses 
GO terms enriched in ranked lists of genes where the 
ranking can be, for example, the result of processing 
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differential expression data or of correlations computed 
between genomic DNA copy number and expression 
[4-6]. FATIGO [7] is also a tool that is useful in the con- 
text of analysing GO terms in ranked lists of genes. DRI- 
Must [8-10] searches for sequence motifs that are 
enriched, in a statistically significant manner, in the top 
of a ranked list of sequences, which can be produced by 
techniques like ChlP-Seq. 

All the aforementioned tools utilize a statistical ap- 
proach that is based on assessing enrichment of an input 
set in an input ranked list by quantifying the enrichment 
obtained at various cutoffs applied to the ranked list. It 
is often the case, however, that two quantitative proper- 
ties need to be compared to each other. For example, 
when the elements are genes, we may have measured 
differential expression values, as well as measured ChlP- 
Seq signals. We are therefore interested in assessing mu- 
tual enrichment in two ranked lists. Another example 
consists of one ranking according to differential expres- 
sion and one according to prediction scores for miRNA 
targets. miTEA [11,12] addresses this latter case by sta- 
tistically assessing the enrichment of miRNA targets in a 
ranked list of genes (also see [13]). To address mutual 
enrichment in two ranked lists over the same set of N 
elements, miTEA [11] performs analysis on permuta- 
tions. Mutual enrichment in the top of two ranked lists 
can be simplified, from a mathematical point of view, by 
arbitrarily setting the indices of one list to the identity 
permutation (1,2,..., AO and treating the other list as a 
permutation n = tt(1 ),..., n{ N) over these numbers. For 
the purpose of assessing the intersection of the top of 
the two ranked lists in a data driven manner, miTEA 
asks which prefix [1,...,^] is enriched in the first n 2 ele- 
ments of 7T, that is in the set 7r(l), n{n 2 ). The statistics 
introduced by miTEA is called mmHG (minimum-mini- 
mum- Hyper- Geometric). A slightly different variant of 
mmHG is described later in this section. 

Statistics in the group of permutations is often dif- 
ficult because the number of entities to be considered by 
any null model is M. Direct exhaustive calculation of tail 
distributions over S N is therefore impractical for all but 
very small values of N. This difficulty is addressed by 
several heuristic techniques. Mapping into continuous 
spaces, such as in [14], has proven useful in certain cases 
but not for studying large deviations. In the case of en- 
richment statistics that focuses on the top of the permu- 
tation and seeks to assess extreme events, such as 
mmHG, we prefer to use bounds on tail probabilities. 
Tail probabilities are useful constructs when applied to 
analysing molecular biology measurement data as they 
enable statistical assessment of observed results. 

In this work we derive tight bounds on the tail probabil- 
ities of mutual enrichment at the top of two random per- 
mutations uniformly drawn over S N and demonstrate the 



utility of this approach in the context of flexible motif dis- 
covery. Our bounds are computable in polynomial time 
and potentially add to the accuracy of reported position 
weight matrix (PWM) motifs for nucleic acid sequences. 

Mutual enrichment in ranked lists - the mmHG statistics 

The mmHG statistics [11] is a generalization of the mHG 
statistics [2,15-17]. The mHG statistics quantifies the en- 
richment level of a set of elements at the top of a ranked 
list of elements of the same type, whereas the mmHG 
statistics assesses the level of mutual enrichment in two 
ranked lists over the same set of elements. While any 
parametric or non-parametric correlation statistics (e.g. 
Spearman's correlation coefficient), that takes the same in- 
put, calculates the overall agreement between the two 
ranked lists, the mmHG statistic focuses only on agreement 
at the top of the two ranked lists. mmHG counts elements 
common to the top of both lists, without predefining what 
top is. Its intended output is the probability of observing an 
intersection at least as large in two randomly ranked lists 
(defined as the enrichment ^-value). In this section we de- 
scribe the mmHG statistics and in later sections we suggest 
tight bounds for the ^-value. Our definition of the mmHG 
statistics varies slightly from that of Steinfeld et al [11], 
which is used by miTEA. 

Mutual enrichment in the top of two ranked lists can 
be simplified, from a mathematical point of view, by ar- 
bitrarily setting the indices of one list to the identity per- 
mutation (1,2,..., N) and treating the other list as a 
permutation. Details of this transform are given in the 
next section. We now define mmHG for the simple 
case of one permutation. Consider a permutation tt = 
7t(1), ji(N) eS N - the group of all permutations over 
the numbers l,...,M mmHG is a function that takes n and 
calculates two numbers l<n 1) n 2 <N such that the ob- 
served intersection between the numbers and the 
first n 2 elements of n - namely, 7r(l), 7i{n 2 ) - is the most 
surprising in terms of the hypergeometric ^-value. 

Formally, given ttgS n and for every l < n lf n 2 < N, let 
W#i>«2) be the size of the intersection of l,...,#i with 
7i(l), ...,7r(« 2 ). Set 

mmHG score(n) 

where HGT is the tail distribution of an hypergeometric 
random variable: 

(n{\( N-ni \ 

min(«i,w 2 ) I . J I .1 
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The mmHG score cannot be considered as a signifi- 
cance measure, due to the multiple testing involved in 
finding n x and n 2 . A simple way to correct an mmHG 
score s for multiple testing and report an upper bound 
on the j^-value is to use the Bonferroni correction. Basically, 
s is multiplied by the number of multiple tests conducted 
(which is A/ 2 ), yielding an upper bound on the p-value, as 
follows: 

mmHG p-value(s 1 N)<s-N 2 

In the Results section we present significantly tighter 
bounds. 

Position weight matrix motifs 

Data produced by techniques such as ChlP-Seq [18], 
ChlP-exo [19], CLIP [20], PAR-CLIP [21] and others are 
readily representable as ranked lists of sequences, where 
the ranking is according to the measured binding affin- 
ity. Computational tools and approaches to motif discov- 
ery form part of the data analysis workflow that is used 
to extract knowledge and understanding from this type 
of studies. We are often interested in sequence motifs 
that are observed to be enriched in sequences where 
strong binding affinity is measured. A position weight 
matrix (PWM) is a commonly used representation of 
motifs in biological sequences [22-24]. This representa- 
tion is more faithful to the underlying biology than rep- 
resentation by exact words, owing to the tendency of 
binding sites to be short and degenerate [25]. A PWM is 
a matrix of score values that gives a weighted match to 
any given substring of fixed length. It has one row for 
each symbol in the alphabet, and one column for each 
position in the pattern. Assuming an input sequence of 
length equal to the PWM width, we simply multiply the 
scores assigned to each letter in each of the positions in 
the input sequence to obtain the likelihood of the input 
string (alternatively, we can sum the logarithms of the 
probabilities). That is, the score assigned by a PWM to a 

K 

substring S = S 1 ...S K is defined as where / repre- 

7=1 

sents a position in the substring; Sj is the symbol at pos- 
ition / in the substring; and p a> j is the score in row a, 
column j of the matrix. In other words, a PWM score is 
the product of position-specific scores for each symbol in 
the substring. This definition can be generalized to yield a 
score for a sequence S = Si...S M longer than the PWM by 

K 

calculating maxi<i<M-K+iY\ps i+ x j- Alternatively, an en- 

hanced model that takes into account multiple occur- 
rences of the PWM in the sequence can be applied by 
summing over sufficiently strong occurrences of the 
PWM or by other more sophisticated approaches [26]. 



mmHG statistics for PWM motifs 

Given a set of sequences that were tested in a high 
throughput experiment such as ChlP-Seq [18], CLIP [20] 
and others, they can be ranked according to the measured 
binding affinities, yielding a ranked list L\. Since usually 
we are interested in finding motifs amongst sequences 
having strong binding affinities, we actually search for mo- 
tifs that are more prevalent at the top of this list. It is clear 
that any algorithm for de-novo flexible motif search would 
need to evaluate candidate PWMs. Given a PWM which 
we want to assess, the sequences can also be ranked ac- 
cording to their PWM scores, yielding another ranked list 
L 2 , different from L\, A significant PWM motif would yield 
significant scores for sequences having strong binding af- 
finities. Therefore, the question of PWM motif discovery 
from ranked experimental data can be formulated as quan- 
tifying the mutual enrichment level for the two ranked lists 
Li and L 2 . Given two ranked lists L x and L 2 over the 
universe of N sequences, they can be transformed into two 
respective permutations, jti = (7T 1 (1 ),..., jti( A/)) and n 2 = 
(7r 2 (l), 7r 2 (A/)). The relative permutation of n 2 w.r.t. 
7Ti, is defined by 7i(jii(j)) = 7r 2 (/), for every ; = 1,...,A/, or sim- 
ply, using operations in the group S^: 7i = 7i 2 -7ii 1 . Using 
the relative permutation 7T, we can represent the mutual 
enrichment of the top parts of L x and L 2 as mmHG score 
{ri), defined above. 

Results 

Estimation of the mmHG p-value - introducing first upper 
bound - B1 

Given an mmHG score s, observed in analysing real 
measurement data, we would like to assess the statistical 
significance of this observation. Assuming endless com- 
putational power, we would enumerate all permutations 
and calculate the mmHG score for each, in order to 
characterize the distribution of mmHG as a random 
variable over S^. The j^-value for s is then simply: 

mmHG p-value(s,N) 

The number of permutations having mmHG score<s 
= AH 

Since the number of permutations is huge, the process 
described above is very far from feasible. Therefore, we seek 
a computationally tractable upper bound, preferably tight. 

A trivial upper bound is the Bonferroni corrected mmHG 
score defined by s • N 2 . A more subtle upper bound was 
suggested by Steinfeld et al [11] and is briefly described 
later as bound B3. In this work we introduce tighter bounds 
that are polynomially computable. 

We next describe an intuitive upper bound (Bl) which 
we later refine to produce a tighter bound (B2). The in- 
put of the problem consists of an mmHG score s and 
the total number of elements N. The output is an upper 
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bound on the p- value. The efficiency of our approach re- 
lies on enumerating all possible HGT scores rather than 
enumerating all permutations in S N . This approach is 
computationally efficient as HGT is a function of four 
input parameters: N, n h n 2i and b. Given N, there are 
0(N 3 ) possible combinations of n lt n 2 , and b. Also, given 
N, Hi and n 2 , b can be any integer in the range [max(0, 
n 2 -N + rii), min(w!, n 2 )]. Next, all is left to do is to de- 
termine how many permutations correspond to each 
HGT score. To this end, let A(N, n h b) be the num- 
ber of permutations for which it holds that b out of the 
first n 2 entries in the permutation are taken from the 
range This formulation is equivalent to count- 

ing permutations for which we attain, at some point, the 
value HGT (A/, n lt n 2 , b), had we taken the exhaustive 
approach. A(N, n h b) can be represented as: 



A(N,n u n 2 ,b) 



^ N ~"l )(n 2 -b)\(N-n 2 y. 



are 



as we first choose b elements from the range [1,. ..,#]] to 
appear at the first n 2 entries of the permutation (there 

^ I possibilities). Then, we choose the positions 

amongst the first n 2 entries that are occupied by these b 
elements, while considering all internal arrangements 

(for each choice of b elements there are ^ ^ 2 ^ b\ possi- 
bilities). We next choose n 2 -b elements from the range 
[«! + l,...,iV] to appear at the rest of the first n 2 entries 

of the permutation (there are ^ ^ n £ ^ possibilities for 

that) and consider all possible (n 2 - b) ! arrangements. 
Finally, we take into account all possible (N - n 2 ) ! 
arrangements of the remaining N-n 2 entries of the 
permutation. 

A straightforward upper bound for the number of per- 
mutations in S N having mmHG score better than s 
follows: 

\{n eS N : mmHG{n^j<s]\< ^ A(N, ni,n 2 , b) 

n\ ,ri2,b:HGT(N,ni ,«2 ,b)<s 

From which an upper bound is easily derived: 



E 



n\,n2,b:HGT{N ,n\,ri2,b)<s 



mmHG p-value{s 1 N)< 



By algebraic manipulations we get: 



A(N,n u n 2 ,b) 



N\ 



mmHG p-value(s,N)< 

n\ ,«2 } b:HGT(N,ni ,«2 ,b) <s 



n l\( N-Hi 

b J \ n 2 -b 

aT 



n 2 

This upper bound is simple and requires 0(A^ 3 ) HGT 
calculations. An HGT calculation takes Q(N) time, 



assuming binomial coefficients can be calculated in con- 
stant time. Constant time computation can be achieved 
using Stirling's approximation [27]: ^J2nn{^j n e^+i<n\< 
\/2nn(^) n eifc, which is tight for large factorials. 

A refined upper bound for the p-value - B2 

The upper bound introduced in the previous section 
counts the number of permutations for which the value 
HGT (A/, n lt n 2 , b) is calculated when taking the non- 
practical exhaustive approach that enumerates over all 
Nl permutations. Ideally, we wish to count the number 
of permutations for which the value HGT (A/, n lt n 2 , b) is 
also their mmHG score, as a permutation may corres- 
pond with many HGT values that are better than s, so it 
can be counted more than once. This explains why the 
formula introduced earlier is an upper bound and not an 
exact j?-value. A second observation that follows is that 
the smaller the mmHG score s, the tighter the bound, 
because a permutation will have fewer combinations (N, 
Hi, n 2 , b) having HGT values better than 5. 

Therefore, if we can reduce the extent of multiple 
counting of the same permutation, we will get a tighter 
bound. We do this by looking one step backwards. If, 
for example, HGT(A/, n lt n 2) b) < s, we can exclude from 
the counting permutations that contain b elements 
from the range at their first n 2 entries because 

they are already taken into account in A (A/, - 1, n 2 ,b) 
(because necessarily HGT (N, rii - 1, n 2 , b) < s, as we will 
later explain). 

Let i|/(A/, n lf n 2 , b) be the set of permutations for which 
it holds that b out of the first n 2 entries are taken from 
the range [l,..*,#i] (note that A{N 1 n 1) n 2l b) introduced 
earlier is, in fact, the size of \|/(AT, « x , n 2 , b)). Assuming 
HGT(Af, rii, n 2 , b) < s, we can partition the set i|/(Af, rii, 
n 2 ,b) into five disjoint subsets ^,...,^5 such that 
^=^ 1 u^ 2 u^ 3 u^ 4 u i// 5 , as follows: 

= i|/(iV, «i, « 2 , b)n\\r(N, n 2 -l, &-l)n\|/(7V, m-1, n 2 , b) 
\f/ 2 — \|/(JV, «i, n 2 , b)ny\f(Nj n 2 -l, b-l)n\\i(N, ni,n 2 -l, b) 
i// 3 = \|/(iV, «i, « 2 , b)n\\r(N, n 2 -l, b-l)n\\i(N, n x -\, n 2 , b-l) 

n\|/(iV, ni, n 2 -l, b-l) 
f 4 = \|/(JV, nx, n 2 , b)n\\j(N, n\-l, n 2 -l, b) 
i// 5 — \|/(JV, «i, n 2 , b)ny\f(Nj n 2 -l, b-2) 

n\\i(N, ni-1, n 2 , b-l)n\\j(N, m, n 2 -l, b-l) 

The properties of the hypergeometric distribution 
imply that given a tuple (N,rii,n 2 ,b), the permutations 
in i// lf i// 2 , can be disregarded from the current count- 
ing iteration. To explain why, we will demonstrate the 
argument on i// lt The permutations in ^ contain b ele- 
ments from the range at their first n 2 entries. 
Recall that we also assume that HGT (A/, n x ,n 2 ,b)< s. 
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Therefore HGT(A/, rii - 1, n 2i b)<s also holds, as the 
same intersection is observed for even a smaller set. 
Thus, the permutations in ^ have already been counted 
as having HGT value better than s when handling the 
triplet n 2 and and can be disregarded for the 

combination of n lf n 2 and b. Similar arguments hold for 
i// 2 and 

The permutations in \f/ 3 should be counted if three 
conditions hold: the first is HGT(A/, n Y - 1, n 2 - l,b - 1) 
> s; the second is HGT(A/, n x - 1, & - 1) > 5; and the 
third is HGT (A/, «i, « 2 - hb - 1) > 5. Otherwise, the per- 
mutations in i// 3 have been counted by former triplets. 
Similarly, the permutations in i// 5 should be counted if 
the following three conditions hold: HGT (N, rii-l,n 2 - 
1, b - 2) > s, HGT(7V, n x - 1, n 2l b-l)>s, and HGT(AT, n l9 
n 2 - 1, b - 1) > 5. Finally, we calculate the sizes of xj/ 3 and 
i// 5 , in the relevant cases. The definition of i// 3 implies 
that it consists of permutations that contain b-1 ele- 
ments taken from the range at their first n 2 -l 
entries, and also n x is positioned at entry n 2 . Therefore: 



l^ 3 l = 



Yl\-\ 
b-1 



n 2 -l 
b-1 



(^-l)!(^)(^)!(iV- W2 )! 



Equivalently, the permutations in if/ 5 contain b-2 ele- 
ments taken from the subset at their first n 2 -l 
entries; n x is positioned at one of the first n 2 -l entries; 
and entry n 2 contains an element from 
Therefore: 



x (« 2 -i)!(iii-i + l)(JSr-ii2)! 



From the above we next conclude an upper bound. 
Denote 

I(HGT(iV, Ml , M2 ,») = lfHGT{N ^^ 
And let A* (A/, «i, 6) = 

\xj/ 3 \ x I(HGT(AT,«i-l,« 2 -l,*-l) > s) 
xI(HGT(AT, « 2 , 6-1) > 5) 
xI(HGT(A/>i,«2-l,6-l) >s) 
+ 

|^ 5 | x I(HGT(AT, «i-l, « 2 -l, 6-2) > s) 
xI(HGT(AT, « 2 , 6-1) > 5) 
xI(HGT(A/>i,«2-l,6-l) >s) 

We can thus derive the following upper bound for the 
^-value: 



mmHG p-value(s,N)< 



Z — /Mi 



,n2-,b:HGT{N ,ni,ri2,b)<s 



A*(N,m,n 2 ,b) 



N\ 



Since A* is recursive, we need to define a base case. 
Recall that given A/, n x and # 2 , & can be any integer in 



the range [max(0, n 2 - N+ n^, min(#i, n 2 )], hence de- 
termining a base case for n\ and n 2 is sufficient (N is 
known). The base case here is that when n Y < 1 or n 2 < 1, 
A*(A/> «!, « 2 , b) is defined the same as A (A/, «i, « 2 , 6). 

This upper bound uses more delicate counting than the 
bound Bl introduced in the previous section. In the following 
sections we assess the tightness of this bound. In later sections 
we demonstrate an application for PWM motif search. 

Comparison to a different mmHG variant - B3 

We note that the bound described in Steinfeld et al [11] 
addresses a slightly different variant of mmHG as a ran- 
dom variable over S^. The definition with which we 
work here is more amenable to deriving tight bounds as 
described above. Given a single permutation ttgS n and 
for every i = 1,...,A/, a binary vector X t is defined in which 
exactly i entries are 1 and N-i entries are 0, as follows: 
Xi(j) = 1 iff n(j) < L The mmHG score of a permutation n 
is then defined by Steinfeld et al [11] as: 

mmHG(n) = mini<i<^P-value(mHG(Xi))<mini<i<^mHG(Xi)4 
Where mHG{X t ) = min Y < „ < N HGT(N, i, n, b n \ N = |A*| 

n 

and b n = ^^Xj(k). A possible upper bound is then given 



k=l 



by: 



(*) P-value(mmHG(n))<mini<i<^mHG(Xi)'i'N 

Computing the latter quantity requires 0(N 2 ) HGT 
calculations, and is therefore computationally more effi- 
cient than the two bounds Bl and B2 of this current 
work (that require 0(N 3 ) HGT calculations). We ob- 
served that our bound B2 was tighter than the bound in 
(*), as later shown in Figure ID. For example, for a per- 
mutation having mmHG score = 7.8T0" 25 (AT =100), our 
bound was 3.5T0 23 while (•) yielded 4.240" 21 . For one 
permutation with mmHG score = 5.1T0" 5 (N= 100), our 
bound was 0.026 while (*) yielded 0.2. The latter ex- 
ample demonstrates that a tighter bound is important 
for classifying an observation as statistically significant 
(assuming a significance threshold of 0.05). 

Assessment of tightness 

In order to assess the quality of our bound B2, we com- 
pared it to the j^-value, which can be calculated exactly 
for small values of N (that is, in cases where Nl is not 
too large) and empirically for larger values of N (by 
randomly sampling permutations). Evidently, our bound 
B2 was significantly tighter than the Bonferroni bound 
for N=10 (Figure 1A) and N=20 (Figure IB). We also 
observed that the smaller the mmHG scores - the 
tighter the bound, consistent with lesser over-counting 
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-log(mmHG score) -log(mmHG score) 

Figure 1 Assessment of tightness. (A) Four lines are shown for N= 10: the mmHG score, which also serves as a lower bound for the p-value; the 
exact p-value calculated by enumerating all 10! permutations; our refined upper bound B2; and the Bonferroni corrected p-value. (B) Here again the four 
lines are shown - for N = 20. However, instead of an exact p-value, which cannot be calculated exhaustively, an empirical p-value is produced by randomly 
sampling 10 7 permutations. (C) In addition to the four lines shown in B, the upper bound B1 is shown (A/ = 20). (D) Four lines are shown for N= 100: the 
mmHG score, our upper bound B2, the bound B3 and the Bonferroni corrected p-value. The exact p-value line is positioned between the green and the 
blue lines. An empirical p-value was not calculated here as even if we sample 10 7 permutations, a p-value smaller than 10~ 7 cannot be obtained. 



for smaller scores as explained in previous sections. 
Furthermore, our refined bound B2 is tighter than the 
bound Bl (Figure 1C), and the latter is significantly 
better than the Bonferroni bound. Both bounds Bl and B2 
are derived by enumerating HGT scores rather than enu- 
merating permutations in S^. The refinement of this ap- 
proach produced by reducing the extent of multiple 
counting of permutation further improves the upper 
bound. In addition, the bound B2 was almost always ob- 
served to be tighter than the bound B3 (Figure ID). 

An upper bound which balances between tightness and 
computational cost - B4 

The bound B2 is, evidently, very tight. It is, however, 
computationally heavy. We would still like to have an 
upper bound which is tighter than the Bonferroni 
bound and than the variant B3 but also faster to calcu- 
late. Such a compromise is achieved by generalizing an 



approach developed in [15] for the minimum hyper- 
geometric statistics. Namely, given the number of ele- 
ments N and an attainable mmHG score s for which 
we want to calculate the j^-value, for each l<b<N and 
for each l<ni<N, let n 2 (b, ni) be the maximal integer 
n 2 so that if in a permutation n e S N , b out of the first 
n 2 entries in n are taken from the range [l,...,^], then 
n satisfies HGT^Af, n 1} n 2f b) < s. Monotonicity prop- 
erties of the hyper-geometric distribution imply the 
existence of such n 2 integers. By definition, they are 
constants and independent of the original permuta- 
tion for which the mmHG score s was obtained. Due 
to monotonicity properties, given b and n Xi the max- 
imal value n 2 (b, rii) can be calculated efficiently using 
binary search, which means that an upper bound that 
requires 0(N 2 \ogN) calculations of HGT (instead of 
0(N 3 )) can be computed by using the following 
formula: 



mmHG p-value(s,N)< ^ 

b,ni,n2{b,ni):HGT(N,ni,ri2{b,ni),b)<s 



n\ \ ( N-ni 
\n 2 {b,m )-b 

N 

n 2 {b,ni) 
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The performance of this bound, as well as of other 
bounds (in terms of tightness and running time), is demon- 
strated in Table 1. On average, this bound was 16.5 times 
tighter than the Bonferroni bound; B3 was approximately 7 
times tighter than Bonferronis bound, while B2 was 38 
times tighter than Bonferronis, on average. The average 
computation time for B4 was 3 minutes, in comparison 
with 1 second for B3 and 26 minutes for B2. We conclude 
that the bound B4 presented in this section may be a 



good compromise between tightness and computa- 
tional cost compared with the other bounds introduced 
in this paper. 

Application in PWM motif search 

In this section we discuss mmHG as a framework for asses- 
sing the significance of PWM motifs in ranked lists. Given 
a ranked list of sequences and a PWM motif, by using the 
mmHG statistics and the bounds introduced earlier, we can 



Table 1 Performance of various bounds 



Protein 


N 


mmHG score 


Bound B2 


Bound B3 


Bound B4 


Bonferroni bound 


TNWMNG 


500 


2.17e-18 


2.75e-14 


7.5e-14 


6.28e-14 


5.43e-13 








0.274 min 


0.0028 min 


0.079 min 




CTNNNAT 


500 


2.86e-27 


1 .32e-28 


3.66e-28 


2.37e-28 


2.86e-27 








0.155 min 


0.0029 min 


0.059 min 




MMMMMMMM 


500 


1 .08e-43 


1 .07e-39 


3.47e-39 


1 .69e-39 


2.71 e-38 








0.104 min 


0.003 min 


0.048 min 




REB1 


4000 


1.66e-137 


9.18e-133 


1 .1 9e-1 31 


1.54e-132 


1.67e-131 








17.25 min 


0.04 min 


2.753 min 




CBF1 


4000 


1 .95e-80 


9.15e-76 


4.62e-75 


1 .84e-75 


1 .96e-74 








26.05 min 


0.03 min 


3.409 min 




UME6 


4000 


5.42e-88 


2.62e-83 


3.04e-82 


5.1 1e-83 


5.43e-82 








23.81 min 


0.03 min 


3.374 min 




JYE7 


4000 


1 .62e-43 


5.63e-39 


2.83e-38 


1 .39e-38 


1 .62e-37 








34.25 min 


0.02 min 


4.05 min 




GCN4 


4000 


2.04e-50 


7.66e-46 


4.62e-45 


1 .80e-45 


2.04e-44 








35.43 min 


0.03 min 


3.95 min 




Puf5 


4795 


7.91 e-85 


3.38e-80 


5.60e-79 


6.95e-80 


7.93e-79 








31.51 min 


0.027 min 


4.51 min 




Pub1 


4251 


1 .49e-84 


6.86e-80 


1 .33e-78 


1 .37e-79 


1 .5e-78 








27.74 min 


0.033 min 


3.81 min 




Pabl 


4142 


2.46e-11 


3.57e-7 


5.17e-7 


1 .37e-6 


2.46e-5 








48.46 min 


0.007 min 


5.41 min 




Khd1 


4773 


2.74e-20 


5.09e-16 


1.46e-14 


1.73e-15 


2.74e-14 








47.58 min 


0.015 min 


5.84 min 




Nab2 


4101 


2.09e-11 


3.08e-7 


1 .48e-5 


1.18e-6 


2.09e-5 








48.7 min 


0.016 min 


5.34 min 




Vts1 


1787 


1.44e-10 


4.74e-6 


1 .33e-5 


1 .4e-5 


1 .45e-4 








21.94 min 


0.003 min 


2.07 min 




Pin4 


4261 


8.16e-14 


1 .32e-9 


8.08e-9 


4.83e-9 


8.18e-8 








49.38 min 


0.011 min 


5.48 min 




Nrd1 


3947 


5.72e-12 


9.09e-8 


5.71 e-6 


3.36e-7 


5.74e-6 








47.67 min 


0.014 min 


5.1 1 min 




YII032C 


2286 


1 .06e-9 


2.62e-5 


1.61e-4 


8.3e-5 


0.001 








35.58 min 


0.003 min 


2.77 min 





Four bounds are compared over 17 datasets (3 synthetic and 14 biological). For each dataset, 
together with the performance of each bound (in terms of tightness and running time). 



the number of sequences (A/) and the mmHG score are indicated, 
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assign a ^-value to represent the significance of that PWM 
being enriched at the top of the list. To apply this approach 
for de-novo motif search, one needs to theoretically con- 
sider all possible PWMs. However, the search space - when 
considering position weight matrix motifs - is huge. As- 
suming the probabilities in the matrix are multiples of 0.1 
and the alphabet is of size 4, there are 286 /c possible candi- 
date PWMs of length k (since each column must sum to 1, 
the number of combinations in each column of the matrix 
is equal to the number of integer solutions for the equation 

Xi + X 2 + X 3 + X 4 = 10, which is ^ ^ )♦ O ur approach to 

navigating in this search space is to narrow the search using 
the IUPAC alphabet, which considers all possible combina- 
tions of letters in the alphabet, and then represent the motif 
as a PWM based on its actual occurrences in the list. This 
heuristic approach, called mmHG-Finder, takes as input a 
ranked list of DNA or RNA sequences and returns signifi- 
cant motifs in PWM format. In cases where sequence rank- 
ing is not relevant or not available, it allows the use of 
positive and negative sets of sequences, searching for 
enriched motifs in the positive set using the negative set as 
the background. 

We next describe the methodology implemented in 
mmHG-Finder. The input consists of a ranked list of se- 
quences (or, alternatively, two sets of sequences repre- 
senting target and background), as well as the motif 
width, given as a range [ki, k?\. 

The algorithm: 

1. Build a generalized suffix tree for the input 
sequences. 

2. For each k=k 1 ,..., k 2 

• Traverse the tree to find all /c-mers 

• Sort the /c-mers according to their enrichment at the 
top of the list (this is done using the mHG 
statistics), as explained in Leibovich et al [8]. 

• Take the most significant fifty /c-mers, to be used as 
starting points for the next step. This set of 
candidates is chosen such that the members are 
quite different. Note that this is a heuristic 
approach and the number 50 is somewhat 
arbitrary, chosen to succeed in catching the best 
performing PWMs without heavily paying in 
complexity. 

• For each starting point, we iteratively replace one 
position in the /c-mer by considering all possible 
IUPAC replacements and taking the one that 
improves the enrichment the most. We repeat this 
process for all positions several times, and eventually 
we get a motif in the IUPAC alphabet. We note that 
given an IUPAC pattern P, the occurrences of P in 
the list are extracted efficiently by traversing the 
paths in the suffix tree that agree with P. 



• Each IUPAC word is then expanded through a 
heuristic approach which is based on the Hamming 
neighbors of that word. Hamming neighbors are 
added as long as the new addition improves the 
enrichment ^-value of the set of words, and as long 
as the overall similarity between the members in the 
set does not decrease below a similarity threshold. 
Since the neighbors are defined as exact words, they 
usually help in fine-tuning the correct weights of 
each letter in each position of the resulting PWM. 
Finally, the expanded motif is converted to a PWM. 

3. The PWMs found in the previous step are assessed 
using the mmHG statistics and the best PWMs are 
returned as output, together with their ^-value. The 
score assigned by a PWM to a string S is the 
maximal score obtained for a substring of S. To 
obtain the likelihood of a substring of length k 
(where k is the PWM width), we simply multiply the 
scores assigned to each letter in each of the 
positions in that substring. 

We provide an efficient implementation of the algorithm 
described above as publicly available software. Our applica- 
tion takes as input a ranked list of sequences and returns 
significant PWM motifs. It is compatible with all operating 
systems and can be freely downloaded from http://bioinfo. 
cs.technion.ac.il/people/zohar/mmHG-Finder-code/. 

To evaluate the performance of mmHG-Finder in 
comparison to other state-of-the-art methods we ran it 
on 18 datasets - 3 synthetically generated datasets and 
15 generated from high throughput binding experiments 
(6 transcription factors and 9 RNA-binding proteins). 
Each synthetic dataset consisted of 500 randomly drawn 
sequences of length 100. Then, variants of a predefined 
IUPAC motif were planted at the top 64 sequences of 
the dataset. We compared the motifs found by mmHG- 
Finder to those obtained by using three other methods: 
the standard MEME program [28], DREME [29], and 
XXmotif [30]. Selected results of this comparison are 
summarized in Figure 2, and the full output is shown 
in Additional file 1. Evidently, mmHG-Finder outper- 
formed all the other three tools on the synthetic exam- 
ples, which contained degenerate motifs. DREME didn't 
find the motifs in any case, while MEME and XXmotif 
found a somewhat similar result in 1 out of the 3 tests. 
The other 15 examples were taken from DNA and RNA 
high-throughput experiments [31-33]. For 12 out of 
these 15 datasets, mmHG-Finder found the motifs which 
were compatible with the known literature motifs, and 
as the most significant result. In comparison, DREME 
found the known consensus in 11 cases; XXmotif de- 
tected the literature motif in 9 cases while MEME de- 
tected the known motif in 8 cases. In several datasets, 
such as for Pin4, mmHG-Finder identified the consensus 
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The protein and mniHG- 
its consensus Finder 
binding motif 



MEME 



DREME 



XXmotif 



Synthetic 

TNWMNG 

Synthetic 

CTNNNAT 

Synthetic 

MMMMMMMM 

P53 (DNA) 

GCN4 (DNA) 

TGAsTCa 

Puf5 (RNA) 

XliUGUAftHAJJ 



Pin4 (RNA) 

a ..ALGA 



6.28e-14 

■T ^ssc . G. 

2.37e-28 

CT AT 

1.69e-39 
1.09e-174 

C . T Z9 

1.8e-45 
T. JC 

6.95e-80 

iTGTA^ 



4.83e-9 



2.2e+6 



5.8e+7 



4.6e-7 



1.7e-16 



Nothing 
found 

Nothing 
found 

Nothing 
found 

4.9e-133 



2.0e-32 

6.8e-42 
3.1e-012 



4.2e+3 
J 



3.1e-51 



2.98e+00 
1.84e+01 

1.58e+01 

le-490 

4.00e-17 



9.76e-21 

1.61e-20 
g ; l T TAax^ tA 
1.65e-28 



Figure 2 Comparison between mmHG-Finder and other motif discovery tools. We evaluated the performance of mmHG-Finder in comparison 
to other state-of-the-art methods: MEME, DREME and XXmotif. Almost all input examples consisted of ranked lists, except for p53 (comprising target 
and background sets). Since MEME, DREME, and XXmotif expect to get a target set as input, we converted the ranked lists into target sets by taking 
the top 100 sequences for MEME (restricted by MEME's limitation of 60,000 characters) and the top 20% sequences for the other tools. In the synthetic 
examples the entire ranked lists were taken as they are sufficiently small (to reflect useful comparison with MEME, as the motif is planted in top 
sequences, we had provided MEME, as input, with the ranking information by adding weights to the sequences, decreasing from 1 to 0 proportionally 
with the ranking). We used the default parameters in all comparison to other tools (e.g. zero-or-one-occurrence per sequence in MEME) and defined 
the expected motif length as the range 6 to 8 where possible (specifically, DREME and XXmotif do not have an input parameter for the motif length). 
Data and consensus motifs for p53 were taken from [31]; for REB1, CBF1, UME6, TYE7, GCN4 from [32]; and for the RNA binding proteins from [33]. 
Selected results are shown. 



motif while other tools returned repetitive sequences as 
their top results. The mmHG statistics avoids such 
spurious results as they typically do not correlate with 
the measurement driven ranking. 

PWM motif search in long-non-coding RNA sequences 

We further analysed a collection of datasets comprising 
human long-non-coding (lnc) RNAs. LncRNA sequences 
were extracted and ranked according to the data reported 
by Cabili et al. [34]. Specifically, a stringent IncRNA set of 
4662 loci was tested, where for each locus we know the 
expression levels in 19 different tissues. From these data 
we generated 19 lists, each ranked according to tissue- 
specificity. Given locus i and tissue the tissue specificity 
score is defined as the difference between the expression 
of locus i in tissue ; (denoted expg) and the mean expres- 
sion of locus i (denoted as That said difference is mea- 
sured in terms of the standard deviation of expression in 
locus i (denoted as a t ). Formally: 



tissue specificity score^ = — 

Calculating the above measure for all tissues reported 
in [34] yielded 19 ranked lists comprising 4360 IncRNAs 
(302 loci having standard deviation equal to zero were 
removed from the analysis). We then conducted three 
enrichment tests for each of these lists: 

1. We searched for de-novo PWM motifs in the 
promoter sequences of the tissue-specific IncRNAs 
using mmHG-Finder (introduced in the previous 
section). Promoter sequences were defined as 
1000 bp upstream the transcription start site. 

2. We scanned the promoter sequences of the 
tissue-specific IncRNAs with PWMs corresponding 
to known transcription factors, downloaded from 
the JASPAR database [35]. 

3. Independently of sequence, we calculated the 
statistical enrichment of measured transcription 
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factor binding events within our lists of loci. 
Transcription factor binding events within IncRNAs 
were downloaded from ChlP-Base database [36], 
which aggregates high-throughput sequencing data 
taken from hundreds of ChlP-Seq experiments. 

Interestingly, almost all the motifs returned by mmHG- 
Finder were GC-rich (Figure 3). In all three tests, the most 
significant results were obtained for thyroid-specific and 
prostate-specific IncRNAs. We further checked whether 
GC rich sequences are generally enriched amongst the 



promoter sequences of tissue specific IncRNAs by calculat- 
ing the mutual enrichment between these two measures. 
The mutual enrichment between GC content and tissue 
specificity (Table 2) was the most significant for thyroid 
(mmHG /rvalue < 3.9T(T 31 ), prostate (5.8T(T 22 ), adrenal 
(5.5T0" 20 ), brain (1.6T0" 14 ) and ovary (8.8T(T 12 ). Interest- 
ingly, Pearsons correlation between the GC content and 
the sequence rank was not observed to be strong (strongest 
correlation coefficient was -0.1), demonstrating that the 
overall agreement between two measures can be weak even 
when extremities agree. 



Tissue 


mmHG-Finder output 


Transcription factors having 
similar recognition sites 


Thyroid 


6.48e-37 


E2F3, E2F2, Zfpl61, Zfx, SP1, 






Egrl,Bcl6b, Klf7, Sp4 


Prostate 


8.69e-23 


Bcl6b, Egrl, Smad3, SP1, Nr2f2, 




C c 


Zfp410, Mafb, Zfx, Zfp740 


Brain 


1.54e-18 


Ziplol, E2F3, TFAP2A, E2F2, 




! qC C c C 


Egrl,SPl,Myc, Sp4 


Ovary 


1 1 1o 1 A 

z.i le-io 


bgri, iNrziz, riagn, dciod, ^maoj, 






<\P1 7fv 7fn7AD 


Foreskin 


7.37e-16 


Zfx, Nr2f2, Egrl,SPl,Zfpl61, 




1 , l-O ^-G 


TFAP2A, Smad3, Bcl6b, SplOO, 






Zicl 


Kidney 


8.01e-ll 


Egrl,SPl,Sp4, Klf7, Zfp281, 




CC 


CTCF, INSMl,Zfp740 


Breast 


2.52e-10 


Egrl,Nr2f2, Zfx, TFAP2A, 






Zfpl61, Zicl, Plagll, Zic2, Tcfap2a 


Adipose 


2.79e-10 


Egrl,Tcfap2b, Plagll, SP1, 






NHLH1, INSM1, E2F2, Smad3, 
Sp4 


Adrenal 


3.78e-10 


E2F3, E2F2, Myf6, Nr2f2, Plagll, 




;cc.^ 


Sp4, Bcl6b, Smad3, CTCF, Zfpl61 


Lymph node 


2.01e-7 


Smad3, Sp4, Zfpl61, E2F2, E2F3, 






Glis2, INSM1, Egrl, SP1, Zfp740 


Testes 


9.55e-6 


Zfp410, Foxll,Gm397, Six2, 




t.ACAU 


Sox30 


Liver 


Nothing found 




Lung 


Nothing found 




White blood cell 


Nothing found 




Colon 


Nothing found 




Heart 


Nothing found 




Skeletal muscle 


Nothing found 




Placenta 


Nothing found 




Lung fibroblasts 


Nothing found 





Figure 3 Motifs in tissue-specific IncRNA promoter sequences. We analysed the promoter sequences of IncRNAs that are ranked according 
tissue-specificity. The motifs returned by mmHG-Finder are shown in the figure together with their p-value. We compared those motifs to 
known consensus motifs of transcription factors using TOMTOM [37] (motif database = JASPAR Vertebrates and UniPROBE Mouse) and the most 
significant results are shown (specifically, all similarity p-values are better than 0.018). 
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Table 2 CpG hypo-methylation in tissue-specific IncRNA promoters 


Tissue 


Mutual enrichment between 
GC content and tissue-specificity 


Mutual enrichment between hypo-methylation and tissue-specificity 


Normal/subnormal cells 


Cancer cells 


Thyroid 


3.89e-31 


No methylation data 


No methylation data 


Prostate 


5.76e-22 


4.16e-11 (PrEC) 


0.002 (LNCaP) 


Adrenal 


5.46e-20 


No methylation data 


No methylation data 


Brain 


1.57e-14 


1.21e-8 (NH-A) 


5.55e-5 (U87) 


Ovary 


8.80e-12 


No methylation data 


0.0085 (ovcar-3) 


Lymph node 


3.64e-6 


No methylation data 


No methylation data 


Adipose 


9.25e-6 


No methylation data 


No methylation data 


Foreskin 


2.25e-5 


0.72 (BJ) 


No methylation data 


Breast 


4.40e-5 


5.08e-5 (HMEC) 


8.45e-5 (MCF7) 






2.0e-10 (MCF10A) 


0.0065 (T-47D) 


Kidney 


6.34e-5 


1.56e-5 (HEK293) 


No methylation data 


White blood cell 


3.78e-4 


0.6 (GM 12878) 


0.21 (Jurkat) 


Placenta 


0.011 


No methylation data 


No methylation data 


Colon 


0.012 


No methylation data 


1 .0 (Caco-2) 


Skeletal muscle 


0.04 


0.34 (SKMC) 


No methylation data 


Lung 


0.33 


No methylation data 


No methylation data 


Heart 


1.0 


1 .0 (HCM) 


No methylation data 






1.0 (HCF) 




Liver 


1.0 


1 .0 (Hepatocytes) 


1.0 (HepG2) 


Testes 


1.0 


No methylation data 


1.0 (NT2-D1) 


Lung fibroblasts 


1.0 


1.0 (IMR90) 


No methylation data 






1 .0 (AGO4450) 





We calculated the mutual enrichment between DNA hypo-methylation and tissue specificity for the IncRNA promoters. CpG methylation data was taken from 
UCSC Table Browser [39] (ENCODE/HAIB). 



Furthermore, by intersecting the results of the sec- 
ond and the third tests together, we identified tran- 
scription factors that may regulate IncRNAs, mainly in 
thyroid and prostate. This set includes NRF1, E2F1, 
E2F3, E2F4, E2F6, EGR1, SP1, SP2 and ZBTB33. More- 
over, the consensus recognition sites of EGR1, SP1 and 
E2F3 were found to be similar to the motifs returned 
by mmHG-Finder in thyroid, prostate and other tissues 
(Figure 3; the comparison was done using the motif discov- 
ery tool TOMTOM [37]). The full output of the second 
and the third tests are summarized in Additional file 2. 

As GC-rich motifs may be associated with CpG methy- 
lation, and due to the possible binding of SP1 which has 
been suggested to protect CpG islands from de novo 
methylation [17,38], we further tested the association be- 
tween hypo-methylation and tissue specificity. For that, 
we downloaded genome wide (450 K) CpG Methylation 
data from UCSC Table Browser [39] (ENCODE/HAIB). 
We intersected IncRNA promoter regions with CpG 
methylation data, and continued only with the 1099 loci 
that were covered by the methylation experiment. For 



them, we calculated the mutual enrichment between 
hypo-methylation and tissue-specificity (the results are 
summarized in Table 2). Thyroid cells were not covered in 
this experiment, however two cell lines corresponding to 
prostate were tested (normal prostate epithelial cell line 
and cancerous prostate endodermal cell line). We ob- 
served that prostate-specific IncRNA promoters were less 
methylated than non-prostate-specific IncRNAs, and this 
was much stronger in normal cells than in cancer (mmHG 
^-value < 4.16e-ll in normal prostate cells, and 0.002 in 
prostate adenocarcinoma cells). We observed strong mu- 
tual enrichment between CpG hypo-methylation and 
tissue-specificity also in brain, ovary, breast and kidney. 
That is, significant mutual enrichment values were found 
for tissues where tissue-specific IncRNAs had GC-rich 
promoter sequences, but these values were not significant 
for tissues that did not show such GC-bias (heart, liver, 
testes, and lung fibroblasts). Additionally, in most cases 
the significance in normal cells was stronger than in can- 
cer, which may be related to changes in methylation pat- 
terns acquired during carcinogenesis [40,41]. 
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Discussion 

The assessment of mutual enrichment in ranked lists is 
often required to support the analysis of biological meas- 
urement data, such as in the case of identifying sequence 
motifs that are involved in regulation processes. Relative 
ranking can be represented by using permutations over 
the measured elements. Therefore - the statistical assess- 
ment of mutual enrichment can be modelled by character- 
izing properties of random permutations. Due to the size 
of the measure space, statistics over S N , the group of per- 
mutations over N elements, is difficult to perform and im- 
plement. Mutual enrichment is more informative from the 
point of view of practical biological science than simple 
correlation measures, as it focuses on the top of the lists 
and not on the overall agreement, which may be weak 
even in cases where extremities agree. In this work we de- 
rive polynomially computable bounds for the associated 
tail distribution of mutual enrichment in ranked lists. 
Namely - we provide methods for computing an upper 
bound on the ^-value of mutual enrichment at the top of 
two permutations uniformly and independently drawn 
over S N . Naive approaches to computing such bounds in- 
clude variants of the Bonferroni approach. These do not 
provide tight bounds and may lead to mis-labeling results 
as non-significant. For several representative datasets, 
we note that our bound improves the Bonferroni derived 
^-value estimates by a factor of almost 40, on average. 
Nevertheless, these improvements become relevant only 
for high ^-values - for which significant scores should 
be treated with care anyway. We therefore note that the 
Bonferroni correction is applicable in many cases, as dem- 
onstrated in Table 1. Using our bounds is highly beneficial 
in borderline cases but is also important in cases where an 
accurate estimate of the ^-value is desired, even if nuances 
do not affect the final biological research conclusions. 

We use our statistical/algorithmic framework to support 
PWM motif searches and demonstrate the application to 
biological data. We identify motifs that characterize tissue 
specificity of IncRNA in thyroid and in prostate. Specific- 
ally, we find the EGR1 binding motif to be enriched in 
the promoter regions of IncRNAs which are thyroid- 
specific. EGR1 was observed to be highly expressed in thy- 
roid (Additional file 1, taken from [36]), consistent with 
our stronger motif findings. Similarly, EGR1 is highly 
expressed in adipose tissue and its transcription factor 
binding sites are enriched in IncRNAs specific to this tis- 
sue. We do not have methylation data for the latter two 
tissues types. However - we do observe the promoters of 
IncRNAs that are specific to breast to have enriched oc- 
currences of motifs that are similar to EGR1 transcription 
factor binding sites (p-value of similarity according to 
TOMTOM = 3.52T0" 5 ). EGR1 is also highly expressed in 
breast. Finally, the promoters of IncRNAs that are specific 
to breast are less methylated in breast (MCF10A and 



HMEC cells) than other promoters. This suggests the role 
of EGR1 in driving tissue differentiation by transcribing 
tissue-specific IncRNAs and by protecting the associated 
promoters from methylation. EGR1 has been previously 
shown to recognize GC-rich consensus sequences located 
in CpG island promoters of active genes [42]. Generally, 
we observed that tissue-specific IncRNA promoters tend 
to be less methylated than those of non-tissue-specific 
IncRNAs in prostate, brain, ovary, breast and kidney, which 
may be associated with the GC-rich patterns enriched 
among their tissue-specific IncRNA promoter sequences. 

Threshold-free alternatives to mmHG include the work 
of McLeay and Bailey, in which a linear regression method 
is applied [43]. It was shown to achieve high accuracy on a 
benchmark comprising 237 ChlP-chip datasets, which 
was higher than all other data driven methods tested, and 
specifically higher than Spearman's rank correlation. We 
note that applying linear regression or Spearman correl- 
ation to PWM motif search in ranked lists requires that 
for significant motifs we observe an overall agreement be- 
tween the biological measurement and the PWM score. 
Nevertheless, the standard PWM formulation fails to pre- 
dict binding affinity when the latter decreases to the point 
of non-specific binding [44]. In other words, the overall 
agreement between the PWM score and the binding af- 
finity may be relatively weak. High correlation between 
the PWM score and the binding affinity needs to hold, 
in effect, only for sequences demonstrating high-binding 
affinity with respect to the protein of interest (that is, 
for sequences that are located at the top of the list) 
[45]. This weaker relationship is naturally addressed by 
the mmHG statistics. A combination of mmHG and a lin- 
ear model, such as suggested in [43], applied to strong 
binders (top of the list), may yield an even more faithful 
and informative model. 

Future research directions include more extensive 
application to biological data and the development of 
tighter and more efficient bounds. Our results show 
promise in enabling efficient and user-friendly PWM 
motif search in ranked lists. The software is freely avail 
able at http://bioinfo.cs.technion.ac.il/people/zohar/mmHG- 
Finder-code/. Finally, the full characterization of the 
distribution of mmHG as a random variable over S N re- 
mains an open question. 

Conclusions 

In this work we developed tight bounds on the tail distri- 
bution of mutual enrichment in ranked lists. Our bounds 
are computable in polynomial time and potentially add to 
the accuracy of reported results. We demonstrated the 
utility of mutual enrichment in motif search - specifically, 
when searching position weight matrix motifs in ranked 
lists, where the ranking can be according to binding affin- 
ity or according to any other biological measurement. 
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Additionally, we used mutual enrichment to study tissue- 
specific long non-coding RNA regulation, and suggest that 
tissue-specific IncRNAs are regulated through GC-rich el- 
ements located on their promoters, in several tissue types. 
We hypothesize that these GC-rich patterns are associated 
with DNA hypo-methylation. 
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